<<<<<<< HEAD

<<<<<<< Updated upstream

title: “NATOLab15” author: “NTO” date: “4/29/2019” output: html_document —


Team Section


Team Question: What factors give countries or individuals advantages over their competition in the Olympics?

Importance: This is an important question as it may provide both individuals and teams recommendations on how they can raise their likelyhood of winning a medal and bringing glory to their countries!

Answer/Conclusion: There are many factors that affect the performace of a country or indiviudual. We thought that gender and the probablity of winning was the most interesting factor that affects performace. Our graph above shows sex has and influnece on probablity of wining medls, the curve is higher for females. this makes sense because they have less people competing for each medal. This supports David’s seciton that there are more and more females starting to particpate in more events. This makes sense because these events are going to have less people.

Recommendations: We reccomend to ecourage females to continue to participate in the olympics. It seems that we are alreadying going in this direction and it is good to continue this trend, for more gender equality in the olympics.


Ethan’s Section


Subquestion: How do countrys’ populations affect olympic performance and how does this correlation differ in different areas of the world?

Importance / Relation to Overall Question: This question contributes to out overall question as it may provide helpful recommendations to countries on how their teams could improve olympic performance based on population and area of the world.

continents <- gapminder::gapminder %>%
  select(country, continent) %>%
  distinct()
tidy_countries <- country_stats %>%
  gather(seq(2,302), key = "year", value = "population")
tidy_countries$year <- parse_double(tidy_countries$year)

olympics2 <- olympics %>%
  left_join(regions, by = "NOC") %>%
  mutate(Medal = if_else(is.na(Medal), "No Medal", Medal)) %>%
  mutate(country = region) %>%
  mutate(country = if_else(country == "USA", "United States", country)) %>%
  mutate(country = if_else(country == "UK", "United Kingdom", country)) %>%
  mutate(country = if_else(country == "Slovakia", "Slovak Republic", country)) %>%
  mutate(country = if_else(country == "Kyrgyzstan", "Kyrgyz Republic", country)) %>%
  mutate(country = if_else(country == "Macedonia", "Macedonia, FYR", country)) %>%
  mutate(year = Year)
olympics3 <- olympics2 %>%
  inner_join(tidy_countries, by = c("country", "year")) %>%
  select(-c(notes,region, Team, NOC, Games, Age, Weight, Height, Sex, Year, City)) %>%
  left_join(continents, by = "country")
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
diagnose <- anti_join(olympics2, continents, by = "country")
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
  • Tidying: this is the tidying section of my analysis, as I joined a gapminder dataset with the olympics dataset to get populations for each country for each year. Some countries that are not recognized in tidy_countries, such as Chinese Taipei, Puerto Rico, and Singapore, won’t appear in this study as their populations are unavailable. Countries that had different names in the two datasets, like the US, the UK, and Slovakia, were renamed with if_else and will appear in this study. After diagnosing from anti_join, around 34000 entries out of 270000 must be dropped.

## [1] 0.2981884
## (Intercept)    pop_prop 
## 0.002970556 0.083985173

  • Transformation: this is the transformation and first graphical / modeling section of my analysis. I made a table in which the two variables of interest for correlation are population proportion and medal proportion. Population proportion takes the population of a given country and divides it by the sum of the populations of other countries competing in the same olympics for that year, and medal proportion is the proportion of events in which individuals from given country medaled for a given year. These proportions are necessary because of confounding variables such as events being added over time, populations rising over time, etc.

  • Findings 1: I first created a basic model and scatterplot of population proportion vs. medal proportion, as well as the corresponding residuals (note: points represent given countries AND given years, though I was unable to label the year on the graph. I put them in this paragraph). I found that there exists an overall positive moderate correlation value of R = 0.30 between the two variables and a weak slope of 0.08. One of the most noticeable ideas about this graph is that based on the labels, we can see that some European countries (Greece and Germany in 1896, France in 1900, and the UK in 1908) are grouped closely together and far away from the US in 1904 and China in 1996, possibly confirming my belief that areas of the world may affect this correlation. It’s also noteable that these extreme proportions took place in the earliest years when there were less countries competing and less events. Now it’s time to analyze how continents of the world affect my overall correlation…

## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.003092     0.217090
## [1] 0.4463364
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   -0.001047     0.336299
## [1] 0.6434126
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0008647    0.0109621
## [1] 0.4353886
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.001948     0.242427
## [1] 0.2253783
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0001996    0.0364840
## [1] 0.3283114
  • Findings 2: After faceting by continent and creating models for each continent (blue lines) while comparing them to the overall model (black lines), we can see that there are clear distinctions between the 5 continents, though Oceania and Africa are rather insignificant due to lower sample size and less visibility. Higher slopes exist in the Americas and Europe, with very strong R values of 0.64 and 0.45, respectively, meaning the correlation between population and olympic performance is very prevelent in these two continents. Asia, on the other hand, shows relatively strong R value of 0.33 but a much weaker slope as countries in this part of the world appear to perform quite poorly regardless of population. This is likely what weakened the overall correlation so much, especially with China in 1996 (over 1 billion people), getting so few medals. I nested by continent so slopes of lines for all continents are visible, and also found all the corresponding filtered R values.

  • Explanation: Europe and the Americas likely show strong correlation between population proportion and olympic performance because they have historically included richer countries that care about sports, meaning population is a significant factor. Asia, on the other hand, has historically included many third world countries under communism, shifting their focus away from sports, meaning a country with a giant population like China won’t necessarily perform better than its neighbors if they also have barely trained top olympians. Much of China’s population also lives in rural areas.

  • New tools: I used modeling tools such as lm(), nesting, and adding predictions and residuals. Explanations of how I used these tools are included above.


Anderson’s Section


Subquestion: How big of a factor is age in terms of winning medals in the olympics?

Importance / Relation to Overall Question: Age is an important factor in the olympics. Depending on the age you could have a lot of professional experience in a sport or nearly none. Looking at the proportion of age and medals won, countries can determine the optimal age for winning.

## # A tibble: 35 x 4
##     Year     n num_entries   prop
##    <dbl> <int>       <int>  <dbl>
##  1  1896    50         380 0.132 
##  2  1900   155        1936 0.0801
##  3  1904   134        1301 0.103 
##  4  1906   137        1733 0.0791
##  5  1908   428        3101 0.138 
##  6  1912   691        4040 0.171 
##  7  1920   394        4292 0.0918
##  8  1924   673        5693 0.118 
##  9  1928   829        5574 0.149 
## 10  1932   381        3321 0.115 
## # … with 25 more rows

Findings: In my findings I have found out that the amount of older participants start to decrease over the years. Through tidying up my data and categorizing participants by age I have developed a graph and a method to check the proportion of winning a bronze, silver, gold or no medal for three different age groups. I have also found a way to plot individual sections of the age categories I created and the type of medal they win.

New Tools: In this lab I used new nesting tools to help me plot the proportions and the age groups into individual graphs. I also put in functions like lm(), add_predictions() and add_residuals().


David’s Section


Subquestion: Does sex have a significant influence on the probability of winning an olympic medal?

Importance / Relation to Overall Question:

sex <- olympics %>% group_by(Sex,Year) %>% summarise(participants = n()) %>% filter(Year != 1994,Year != 1998,Year != 2002,Year != 2006,Year != 2010,Year != 2014)
## groups by given variables and then adds a column of today participants in that event
sex1 <- olympics %>% group_by(Sex,Year) %>% count(Medal)
sexxy <- inner_join(sex1,sex) %>% mutate(prop = n/participants)
sassy <- sexxy %>% filter(Medal != "No Medal")%>% group_by(Sex,Year) %>% summarize(placing = sum(n),prop = sum(prop)) %>% filter(Year != 1994,Year != 1998,Year != 2002,Year != 2006,Year != 2010,Year != 2014)

## try to find the probalilty you will place in any event for a given year, given your sex


mod2 <- lm(placing ~ Year * Sex, data = sassy)

gridtry <- sassy %>% 
 data_grid(placing,Year) %>% 
  gather_predictions(mod2)

ggplot(sassy, aes(Year,placing,colour = Sex)) + 
  geom_point() + 
  geom_line(data = gridtry, aes(y = pred),size = 1)+ggtitle("Time Series on number of medalists for each sex")

Since there can nonly be 3 medalists per team/individal fo reach event, an increase in medals means that more and more games have been added to the olympics. Women have been included in more and more events as the years progressed. The line of best fit for the females backs up this claim.

ggplot(data = sassy,aes(x = Year, y = prop))+
  geom_line(aes(color = Sex),size = 1.2)+ggtitle("Probabily of being a medalist over the years")+ylab("Probality of being a Medalist")

## 68 countries did not partake in 1980 olympics # near the end of the cold war

Findings: From the graph above, it seems that sex does have a effect on the probability that an individual will earn a medal. Women genreally have a higher probablity of winning a metal as they have less people competing for each metal. In 1980’s the olympics were housed in Russia. In these olympics 68 countries refused to participate. This explains the spike in the probability chart. Less people == higher likelihood of placing. To summarize, women are being included in more and more olympics sports and if you area woman, there are less people competing for those medals. You have a slihgly better chance at earning a medal.

New Tools: I used commands like filter(), mutate(), innner_join(), and full_join() to tidy the olympic data in order to find the probalilty an individual will place in any event for a given year, given thier sex. I used the model y ~ x1*x2 or placing ~ year * Sex to see if the Sex along with Year influence the amount of medals given. I used gather_predictions() to plot the lines of best fit for each sex. Sex does have an effect on the coorelation between year and medals given. The line of best fit for females is significally steeper. This means that females are being included in more and more events.


Ryan’s Section


Subquestion: In 1980, how does height affect winnings for Canoeing sports?

Importance / Relation to Overall Question: You would think that height would affect the performance of Canoeing Sports, as taller someone is, the more likely they are to have long, strong strokes.

## # A tibble: 108 x 3
## # Groups:   Event [27]
##    Event                                          Medal        n
##    <chr>                                          <chr>    <int>
##  1 Canoeing Men's Canadian Doubles, 1,000 metres  Bronze      38
##  2 Canoeing Men's Canadian Doubles, 1,000 metres  Gold        38
##  3 Canoeing Men's Canadian Doubles, 1,000 metres  No Medal   354
##  4 Canoeing Men's Canadian Doubles, 1,000 metres  Silver      38
##  5 Canoeing Men's Canadian Doubles, 10,000 metres Bronze       8
##  6 Canoeing Men's Canadian Doubles, 10,000 metres Gold         8
##  7 Canoeing Men's Canadian Doubles, 10,000 metres No Medal    36
##  8 Canoeing Men's Canadian Doubles, 10,000 metres Silver       8
##  9 Canoeing Men's Canadian Doubles, 500 metres    Bronze      18
## 10 Canoeing Men's Canadian Doubles, 500 metres    Gold        18
## # … with 98 more rows
##  (Intercept)       Height 
## -0.728978886  0.006187414

Findings: There is a slight positive correlation between performance and height, so we can kind of conlcude that ones height might slighlty impact their performance. This direclty answers the main question by suggesting that there might be a poor correlation between height and performing better in the olympics.

New Tools: I used data_grid, add_predictions, lm, and ifelse statements


Arie’s Section


Subquestion: Does a country do better or worse depending on the season of the olympics?

Importance/ elation to Overall Question:: This Question is important and interesting because countries will be able to understand which season they do better in. This will help them improve during their weak season.

## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       31.86       346.59
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -6.136      606.551
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       -30.5        798.8
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -4.118      432.732

Findings: Since there were so many countires, I focused on places with different climates (USA, Russia, China, and Brazil). The grapphs above show that warmer climate countries perform better in the summer, coolor climate countires perform better in the winter. This answers my quesiton, “Does a country do better or worse depending on the season of the olympics?” This relates to the overall quesiton because, this shows countries with warner climates have an advantage in the summer and countries with cooler climates have the advantage in the winter.

New Tools: I used plot_grid(), map(), nest(), lm().


Lab 2 Reflections


  • Team: At the begging of the semester our goals were “We aspire to enhance our technical, communication, and project management skills. We also aspire to be a rad team but not team number one.” We think we did a pretty good job in meeting our goals. If we could travel back in time we would tell our team to keep being rad, meet up more often, and start discussing reading materials before taking the irat and trat.

  • Ethan: After learning much more about the data science field, my 6-month goal after graduation has changed as it may not be necessary for me to get a master’s degree in data science. I hope to gain enough knowledge and experience through my major, online self learning, and internships to become a data scientst soon after I graduate. My 5 year goal has not changed as I still hope to travel and work in the big city! I learned so much R in this course and it has further ignited my confidence and my passion for data science. If I were go go back to the beginning of the semester, I’d tell myself to keep working on personal projects so I can enhance my skills quickly.

  • David (big gey):

  • Ryan: My 6 month goal after graduating was wantint to find a job as a data scientist. This did not really change. My 5 years goal after graduating wanting to be working remotely and traveling. This also did not really change. I learned base R, for the most part. I also learned a little bit more about what it’s like to work on a team. Oh and I learned gitKraken. If I could give myself advice, it would be to start aksing more quesiton, keep studying more for the tRats, and stop second guessing yourself.

  • Arie: After taking this class, my 6-month and 5 year goal still remains the same. However, I think I want to start working on my buisness plan as I secure a data science job. I learned so much in this class! I’ve never programmed before, and now I can do some programming in R. I learned about permuation tests. If I were to go back and give myself advice, I would tell myself to do all the readings and exercises thoroughly, and practice R more outside of class.

  • Anderson: After learning a lot about the basic functions and format in the data science field my 6-month goal has changed. I will try to minor if not major in the data sience field and learn as many useful and necessary languages to compute and tidy data. My 5 year goal has remained somewhat the same. I still plan on living in the city and hope to have a fulfilling job. Throughout this entire course I have gained lots of knowledge on computing and feel comfortable with the language of R. If I could go back and give myself advice it would be to review the material and practice each chapter on my own.


Who Did What


  • Ethan: Individual section, helped other ind. sections, formatting

  • David (big gey):

  • Ryan: ind section

  • Arie:

=======

title: “NATOLab15” author: “NTO” date: “4/29/2019” output: html_document —


Team Section


Team Question: What factors give countries or individuals advantages over their competition in the Olympics?

Importance: This is an important question as it may provide both individuals and teams recommendations on how they can raise their likelyhood of winning a medal and bringing glory to their countries!

Answer/Conclusion: There are many factors that affect the performace of a country or indiviudual. We found that

Recommendations:


Ethan’s Section


Subquestion: How do countrys’ populations affect olympic performance and how does this correlation differ in different areas of the world?

Importance / Relation to Overall Question: This question contributes to out overall question as it may provide helpful recommendations to countries on how their teams could improve olympic performance based on population and area of the world.

continents <- gapminder::gapminder %>%
  select(country, continent) %>%
  distinct()
tidy_countries <- country_stats %>%
  gather(seq(2,302), key = "year", value = "population")
tidy_countries$year <- parse_double(tidy_countries$year)

olympics2 <- olympics %>%
  left_join(regions, by = "NOC") %>%
  mutate(Medal = if_else(is.na(Medal), "No Medal", Medal)) %>%
  mutate(country = region) %>%
  mutate(country = if_else(country == "USA", "United States", country)) %>%
  mutate(country = if_else(country == "UK", "United Kingdom", country)) %>%
  mutate(country = if_else(country == "Slovakia", "Slovak Republic", country)) %>%
  mutate(country = if_else(country == "Kyrgyzstan", "Kyrgyz Republic", country)) %>%
  mutate(country = if_else(country == "Macedonia", "Macedonia, FYR", country)) %>%
  mutate(year = Year)
olympics3 <- olympics2 %>%
  inner_join(tidy_countries, by = c("country", "year")) %>%
  select(-c(notes,region, Team, NOC, Games, Age, Weight, Height, Sex, Year, City)) %>%
  left_join(continents, by = "country")
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
diagnose <- anti_join(olympics2, continents, by = "country")
## Warning: Column `country` joining character vector and factor, coercing
## into character vector
  • Tidying: this is the tidying section of my analysis, as I joined a gapminder dataset with the olympics dataset to get populations for each country for each year. Some countries that are not recognized in tidy_countries, such as Chinese Taipei, Puerto Rico, and Singapore, won’t appear in this study as their populations are unavailable. Countries that had different names in the two datasets, like the US, the UK, and Slovakia, were renamed with if_else and will appear in this study. After diagnosing from anti_join, around 34000 entries out of 270000 must be dropped.

## [1] 0.2981884
## (Intercept)    pop_prop 
## 0.002970556 0.083985173

  • Transformation: this is the transformation and first graphical / modeling section of my analysis. I made a table in which the two variables of interest for correlation are population proportion and medal proportion. Population proportion takes the population of a given country and divides it by the sum of the populations of other countries competing in the same olympics for that year, and medal proportion is the proportion of events in which individuals from given country medaled for a given year. These proportions are necessary because of confounding variables such as events being added over time, populations rising over time, etc.

  • Findings 1: I first created a basic model and scatterplot of population proportion vs. medal proportion, as well as the corresponding residuals (note: points represent given countries AND given years, though I was unable to label the year on the graph. I put them in this paragraph). I found that there exists an overall positive moderate correlation value of R = 0.30 between the two variables and a weak slope of 0.08. One of the most noticeable ideas about this graph is that based on the labels, we can see that some European countries (Greece and Germany in 1896, France in 1900, and the UK in 1908) are grouped closely together and far away from the US in 1904 and China in 1996, possibly confirming my belief that areas of the world may affect this correlation. It’s also noteable that these extreme proportions took place in the earliest years when there were less countries competing and less events. Now it’s time to analyze how continents of the world affect my overall correlation…

## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.003092     0.217090
## [1] 0.4463364
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   -0.001047     0.336299
## [1] 0.6434126
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0008647    0.0109621
## [1] 0.4353886
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.001948     0.242427
## [1] 0.2253783
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0001996    0.0364840
## [1] 0.3283114
  • Findings 2: After faceting by continent and creating models for each continent (blue lines) while comparing them to the overall model (black lines), we can see that there are clear distinctions between the 5 continents, though Oceania and Africa are rather insignificant due to lower sample size and less visibility. Higher slopes exist in the Americas and Europe, with very strong R values of 0.64 and 0.45, respectively, meaning the correlation between population and olympic performance is very prevelent in these two continents. Asia, on the other hand, shows relatively strong R value of 0.33 but a much weaker slope as countries in this part of the world appear to perform quite poorly regardless of population. This is likely what weakened the overall correlation so much, especially with China in 1996 (over 1 billion people), getting so few medals. I nested by continent so slopes of lines for all continents are visible, and also found all the corresponding filtered R values.

  • Explanation: Europe and the Americas likely show strong correlation between population proportion and olympic performance because they have historically included richer countries that care about sports, meaning population is a significant factor. Asia, on the other hand, has historically included many third world countries under communism, shifting their focus away from sports, meaning a country with a giant population like China won’t necessarily perform better than its neighbors if they also have barely trained top olympians. Much of China’s population also lives in rural areas.

  • New tools: I used modeling tools such as lm(), nesting, and adding predictions and residuals. Explanations of how I used these tools are included above.


Anderson’s Section


Subquestion: How big of a factor is age in terms of winning medals in the olympics?

Importance / Relation to Overall Question: Age is an important factor in the olympics. Depending on the age you could have a lot of professional experience in a sport or nearly none. Looking at the proportion of age and medals won, countries can determine the optimal age for winning.

## # A tibble: 35 x 4
##     Year     n num_entries   prop
##    <dbl> <int>       <int>  <dbl>
##  1  1896    50         380 0.132 
##  2  1900   155        1936 0.0801
##  3  1904   134        1301 0.103 
##  4  1906   137        1733 0.0791
##  5  1908   428        3101 0.138 
##  6  1912   691        4040 0.171 
##  7  1920   394        4292 0.0918
##  8  1924   673        5693 0.118 
##  9  1928   829        5574 0.149 
## 10  1932   381        3321 0.115 
## # … with 25 more rows

Findings: In my findings I have found out that the amount of older participants start to decrease over the years. Through tidying up my data and categorizing participants by age I have developed a graph and a method to check the proportion of winning a bronze, silver, gold or no medal for three different age groups. I have also found a way to plot individual sections of the age categories I created and the type of medal they win.

New Tools: In this lab I used new nesting tools to help me plot the proportions and the age groups into individual graphs. I also put in functions like lm(), add_predictions() and add_residuals().


David’s Section


Subquestion: Does sex have a significant influence on the probability of winning an olympic medal?

Importance / Relation to Overall Question:

sex <- olympics %>% group_by(Sex,Year) %>% summarise(participants = n()) %>% filter(Year != 1994,Year != 1998,Year != 2002,Year != 2006,Year != 2010,Year != 2014)
## groups by given variables and then adds a column of today participants in that event
sex1 <- olympics %>% group_by(Sex,Year) %>% count(Medal)
sexxy <- inner_join(sex1,sex) %>% mutate(prop = n/participants)
sassy <- sexxy %>% filter(Medal != "No Medal")%>% group_by(Sex,Year) %>% summarize(placing = sum(n),prop = sum(prop)) %>% filter(Year != 1994,Year != 1998,Year != 2002,Year != 2006,Year != 2010,Year != 2014)

## try to find the probalilty you will place in any event for a given year, given your sex


mod2 <- lm(placing ~ Year * Sex, data = sassy)

gridtry <- sassy %>% 
 data_grid(placing,Year) %>% 
  gather_predictions(mod2)

ggplot(sassy, aes(Year,placing,colour = Sex)) + 
  geom_point() + 
  geom_line(data = gridtry, aes(y = pred),size = 1)+ggtitle("Time Series on number of medalists for each sex")

Since there can nonly be 3 medalists per team/individal fo reach event, an increase in medals means that more and more games have been added to the olympics. Women have been included in more and more events as the years progressed.

ggplot(data = sassy,aes(x = Year, y = prop))+
  geom_line(aes(color = Sex),size = 1.2)+ggtitle("Number of Medalists over the years")+ylab("Probality of being a Medalist")

see <- sassy %>% filter(Year < 1970)
## 68 countries did not partake in 1980 olympics # near the end of the cold war

Findings:

New Tools:


Ryan’s Section


Subquestion: In 1980, how does height affect winnings for Canoeing sports?

Importance / Relation to Overall Question: You would think that height would affect the performance of Canoeing Sports, as taller someone is, the more likely they are to have long, strong strokes.

## # A tibble: 108 x 3
## # Groups:   Event [27]
##    Event                                          Medal        n
##    <chr>                                          <chr>    <int>
##  1 Canoeing Men's Canadian Doubles, 1,000 metres  Bronze      38
##  2 Canoeing Men's Canadian Doubles, 1,000 metres  Gold        38
##  3 Canoeing Men's Canadian Doubles, 1,000 metres  No Medal   354
##  4 Canoeing Men's Canadian Doubles, 1,000 metres  Silver      38
##  5 Canoeing Men's Canadian Doubles, 10,000 metres Bronze       8
##  6 Canoeing Men's Canadian Doubles, 10,000 metres Gold         8
##  7 Canoeing Men's Canadian Doubles, 10,000 metres No Medal    36
##  8 Canoeing Men's Canadian Doubles, 10,000 metres Silver       8
##  9 Canoeing Men's Canadian Doubles, 500 metres    Bronze      18
## 10 Canoeing Men's Canadian Doubles, 500 metres    Gold        18
## # … with 98 more rows
##  (Intercept)       Height 
## -0.728978886  0.006187414

Findings: There is a slight positive correlation between performance and height, so we can kind of conlcude that ones height might slighlty impact their performance. This direclty answers the main question by suggesting that there might be a poor correlation between height and performing better in the olympics.

New Tools: I used data_grid, add_predictions, lm, and ifelse statements


Arie’s Section


Subquestion: Does a country do better or worse depending on the season of the olympics?

Importance/ elation to Overall Question:: This Question is important and interesting because countries will be able to understand which season they do better in. This will help them improve during their weak season.

## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       31.86       346.59
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -6.136      606.551
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       -30.5        798.8
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -4.118      432.732

Findings: Since there were so many countires, I focused on places with different climates (USA, Russia, China, and Brazil). The grapphs above show that warmer climate countries perform better in the summer, coolor climate countires perform better in the winter. This answers my quesiton, “Does a country do better or worse depending on the season of the olympics?” This relates to the overall quesiton because, this shows countries with warner climates have an advantage in the summer and countries with cooler climates have the advantage in the winter.

New Tools: I used plot_grid(), map(), nest(), lm().


Lab 2 Reflections


  • Team: At the begging of the semester our goals were “We aspire to enhance our technical, communication, and project management skills. We also aspire to be a rad team but not team number one.” We think we did a pretty good job in meeting our goals. If we could travel back in time we would tell our team to keep being rad, meet up more often, and start discussing reading materials before taking the irat and trat.

  • Ethan: After learning much more about the data science field, my 6-month goal after graduation has changed as it may not be necessary for me to get a master’s degree in data science. I hope to gain enough knowledge and experience through my major, online self learning, and internships to become a data scientst soon after I graduate. My 5 year goal has not changed as I still hope to travel and work in the big city! I learned so much R in this course and it has further ignited my confidence and my passion for data science. If I were go go back to the beginning of the semester, I’d tell myself to keep working on personal projects so I can enhance my skills quickly.

  • David (big gey):

  • Ryan: My 6 month goal after graduating was wantint to find a job as a data scientist. This did not really change. My 5 years goal after graduating wanting to be working remotely and traveling. This also did not really change. I learned base R, for the most part. I also learned a little bit more about what it’s like to work on a team. Oh and I learned gitKraken. If I could give myself advice, it would be to start aksing more quesiton, keep studying more for the tRats, and stop second guessing yourself.

  • Arie: After taking this class, my 6-month and 5 year goal still remains the same. However, I think I want to start working on my buisness plan as I secure a data science job. I learned so much in this class! I’ve never programmed before, and now I can do some programming in R. I learned about permuation tests. If I were to go back and give myself advice, I would tell myself to do all the readings and exercises thoroughly, and practice R more outside of class.

  • Anderson: After learning a lot about the basic functions and format in the data science field my 6-month goal has changed. I will try to minor if not major in the data sience field and learn as many useful and necessary languages to compute and tidy data. My 5 year goal has remained somewhat the same. I still plan on living in the city and hope to have a fulfilling job. Throughout this entire course I have gained lots of knowledge on computing and feel comfortable with the language of R. If I could go back and give myself advice it would be to review the material and practice each chapter on my own.


Who Did What


  • Ethan: Individual section, helped other ind. sections, formatting

  • David (big gey):

  • Ryan: ind section

  • Arie: in section, team findings/conclusion, recomendations, team reflection

  • Anderson: Individual section, tidying

======= NATOLab15

Team Section


Team Question: What factors give countries or individuals advantages over their competition in the Olympics?

Importance: This is an important question as it may provide both individuals and teams recommendations on how they can raise their likelyhood of winning a medal and bringing glory to their countries!

Answer/Conclusion:

Recommendations:


Ethan’s Section


Subquestion: How do countrys’ populations affect olympic performance and how does this correlation differ in different areas of the world?

Importance / Relation to Overall Question: This question contributes to out overall question as it may provide helpful recommendations to countries on how their teams could improve olympic performance based on population and area of the world.

continents <- gapminder::gapminder %>%
  select(country, continent) %>%
  distinct()
tidy_countries <- country_stats %>%
  gather(seq(2,302), key = "year", value = "population")
tidy_countries$year <- parse_double(tidy_countries$year)

olympics2 <- olympics %>%
  left_join(regions, by = "NOC") %>%
  mutate(Medal = if_else(is.na(Medal), "No Medal", Medal)) %>%
  mutate(country = region) %>%
  mutate(country = if_else(country == "USA", "United States", country)) %>%
  mutate(country = if_else(country == "UK", "United Kingdom", country)) %>%
  mutate(country = if_else(country == "Slovakia", "Slovak Republic", country)) %>%
  mutate(country = if_else(country == "Kyrgyzstan", "Kyrgyz Republic", country)) %>%
  mutate(country = if_else(country == "Macedonia", "Macedonia, FYR", country)) %>%
  mutate(year = Year)
olympics3 <- olympics2 %>%
  inner_join(tidy_countries, by = c("country", "year")) %>%
  select(-c(notes,region, Team, NOC, Games, Age, Weight, Height, Sex, Year, City)) %>%
  left_join(continents, by = "country")

diagnose <- anti_join(olympics2, continents, by = "country")
## Warning: Factor `continent` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `continent` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## [1] 0.2981884
## (Intercept)    pop_prop 
## 0.002970556 0.083985173

## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.003092     0.217090
## [1] 0.4463364
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   -0.001047     0.336299
## [1] 0.6434126
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0008647    0.0109621
## [1] 0.4353886
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##    0.001948     0.242427
## [1] 0.2253783
## 
## Call:
## lm(formula = medal_prop ~ pop_prop, data = df)
## 
## Coefficients:
## (Intercept)     pop_prop  
##   0.0001996    0.0364840
## [1] 0.3283114

Anderson’s Section


Subquestion: How big of a factor is age in terms of winning medals in the olympics?

Importance / Relation to Overall Question: Age is an important factor in the olympics. Depending on the age you could have a lot of professional experience in a sport or nearly none. Looking at the proportion of age and medals won, countries can determine the optimal age for winning.

## # A tibble: 35 x 4
##     Year     n num_entries   prop
##    <dbl> <int>       <int>  <dbl>
##  1  1896    50         380 0.132 
##  2  1900   155        1936 0.0801
##  3  1904   134        1301 0.103 
##  4  1906   137        1733 0.0791
##  5  1908   428        3101 0.138 
##  6  1912   691        4040 0.171 
##  7  1920   394        4292 0.0918
##  8  1924   673        5693 0.118 
##  9  1928   829        5574 0.149 
## 10  1932   381        3321 0.115 
## # ... with 25 more rows

Findings: In my findings I have found out that the amount of older participants start to decrease over the years. Through tidying up my data and categorizing participants by age I have developed a graph and a method to check the proportion of winning a bronze, silver, gold or no medal for three different age groups. I have also found a way to plot individual sections of the age categories I created and the type of medal they win.

New Tools: In this lab I used new nesting tools to help me plot the proportions and the age groups into individual graphs. I also put in functions like lm(), add_predictions() and add_residuals().


David’s Section


Subquestion: Does sex have a significant influence on the probability of winning an olympic medal?

Importance / Relation to Overall Question:

Since there can nonly be 3 medalists per team/individal fo reach event, an increase in medals means that more and more games have been added to the olympics. Women have been included in more and more events as the years progressed. The line of best fit for the females backs up this claim.

Findings: From the graph above, it seems that sex does have a effect on the probability that an individual will earn a medal. Women genreally have a higher probablity of winning a metal as they have less people competing for each metal. In 1980’s the olympics were housed in Russia. In these olympics 68 countries refused to participate. This explains the spike in the probability chart. Less people == higher likelihood of placing. To summarize, women are being included in more and more olympics sports and if you area woman, there are less people competing for those medals. You have a slihgly better chance at earning a medal.

New Tools: I used commands like filter(), mutate(), innner_join(), and full_join() to tidy the olympic data in order to find the probalilty an individual will place in any event for a given year, given thier sex. I used the model y ~ x1*x2 or placing ~ year * Sex to see if the Sex along with Year influence the amount of medals given. I used gather_predictions() to plot the lines of best fit for each sex. Sex does have an effect on the coorelation between year and medals given. The line of best fit for females is significally steeper. This means that females are being included in more and more events.


Ryan’s Section


Subquestion: In 1980, how does height affect winnings for Canoeing sports?

Importance / Relation to Overall Question: You would think that height would affect the performance of Canoeing Sports, as taller someone is, the more likely they are to have long, strong strokes.

## # A tibble: 108 x 3
## # Groups:   Event [27]
##    Event                                          Medal        n
##    <chr>                                          <chr>    <int>
##  1 Canoeing Men's Canadian Doubles, 1,000 metres  Bronze      38
##  2 Canoeing Men's Canadian Doubles, 1,000 metres  Gold        38
##  3 Canoeing Men's Canadian Doubles, 1,000 metres  No Medal   354
##  4 Canoeing Men's Canadian Doubles, 1,000 metres  Silver      38
##  5 Canoeing Men's Canadian Doubles, 10,000 metres Bronze       8
##  6 Canoeing Men's Canadian Doubles, 10,000 metres Gold         8
##  7 Canoeing Men's Canadian Doubles, 10,000 metres No Medal    36
##  8 Canoeing Men's Canadian Doubles, 10,000 metres Silver       8
##  9 Canoeing Men's Canadian Doubles, 500 metres    Bronze      18
## 10 Canoeing Men's Canadian Doubles, 500 metres    Gold        18
## # ... with 98 more rows
##  (Intercept)       Height 
## -0.728978886  0.006187414

Findings: There is a slight positive correlation between performance and height, so we can kind of conlcude that ones height might slighlty impact their performance. This direclty answers the main question by suggesting that there might be a poor correlation between height and performing better in the olympics.

New Tools: I used data_grid, add_predictions, lm, and ifelse statements


Arie’s Section


Subquestion: Does a country do better or worse depending on the season of the olympics?

Importance/ elation to Overall Question:: This Question is important and interesting because countries will be able to understand which season they do better in. This will help them improve during their weak season.

## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       31.86       346.59
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -6.136      606.551
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##       -30.5        798.8
## 
## Call:
## lm(formula = Gold_Medal_count ~ gold_prop, data = df)
## 
## Coefficients:
## (Intercept)    gold_prop  
##      -4.118      432.732

Findings: Since there were so many countires, I focused on places with different climates (USA, Russia, China, and Brazil). The grapphs above show that warmer climate countries perform better in the summer, coolor climate countires perform better in the winter. This answers my quesiton, “Does a country do better or worse depending on the season of the olympics?” This relates to the overall quesiton because, this shows countries with warner climates have an advantage in the summer and countries with cooler climates have the advantage in the winter.

New Tools: I used plot_grid(), map(), nest(), lm().


Lab 2 Reflections



Who Did What


>>>>>>> master